|
|
Audio Visual Interactions in Multimodal Communications
Abstract
Multimodal signal processing is more than simply "putting
together" text, audio, images and video; it is the integration and
interaction among these different media that creates new systems and new
research challenges and opportunities. Unimodal analysis of
signals can deliver acceptable performance levels only in benign
situations; the performance decreases rapidly when countermeasures are
taken. For example, person authentication systems useful in
security, access control and surveillance applications do not perform
well when subjects age, when the video resolution is inadequate, or poor
lighting conditions are present. Many of these difficulties can be
overcome by adding an audio signature along with the video.
In multimodal communications where humans speech is involved,
audio-visual interaction is particularly significant. Human
perception of speech is bimodal in that acoustic speech can be affected
by visual cues from lip movement. Due to the bimodality in speech
perception, audio-visual interaction is an important design factor for
multimodal communication systems, such as video telephony and video
conferencing. A prime example of this interaction is lip or speech
reading. It is used by the hearing-impaired for enhancing their
speech understanding capability but also by every normal hearing person
to some extent, in particular in noisy environments.
One key issue in bimodal speech analysis and synthesis is the
establishment of the mapping between acoustic and visual
parameters. A novel approach for establishing this mapping was
developed during our previous funding period. Our current work
addresses two inter-related problems. First, the synthesis of
articulatory parameters for an MPEG-4 facial animation model is being
considered. Second, we are concerned with the task of robust speech
recognition. Fusing these two areas will impact the fields of very
low bit-rate coding of speech and images, speech and text driven facial
animation parameters, speech and text driven facial animation of
synthetic actors (i.e. avators) and audio-visual speech recognition.
Students
- Jay Williams (Ph.D - June, 2000)
- Zhilin Wu
- Petar Aleksic
Publications
- P. S. Aleksic and A. K. Katsaggelos, "Comparison of Low- and High-level Visual Features for Audio-Visual Continuous Automatic Speech Recognition," submitted for
publication, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004.
-
Z. Wu, P. S. Aleksic, and A. K. Katsaggelos, "Inner Lip Feature Extraction for MPEG-4 Facial Animation," submitted for publication, International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 2004.
-
P. S. Aleksic and A. K. Katsaggelos , "An Audio-Visual Person Identification and Verification System Using FAPs as Visual Features," Workshop on Multimedia
User Authentication, Santa Barbara, California, December 2003.
-
P. S. Aleksic and A. K. Katsaggelos, "Speech-to-Video Synthesis Using MPEG-4 Compliant Visual Features," IEEE Transactions on Circuits and Systems for Video
Technology: Special Issue on Audio and Video Analysis for Multimedia Interactive Services, accepted for publication, February 2004.
-
P. S. Aleksic and A. K. Katsaggelos , "Speech-to-Video Synthesis Using Facial Animation Parameters," International Conference on Image Processing (ICIP),
Barcelona, Spain, September 2003.
-
P. S. Aleksic and A. K. Katsaggelos, "Product HMMs for Audio-Visual Continuous Speech Recognition Using Facial Animation Parameters," International Conference
on Multimedia and Expo (ICME), Baltimore, July 2003.
-
P. S. Aleksic, J. J. Williams, and A. K. Katsaggelos, "Speech-to-Video Synthesis Using MPEG-4 Compliant Visual Features," 4th European Workshop on Image
Analysis for Multimedia Interactive Services (WIAMIS), London, April 2003.
-
P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, "Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features," EURASIP Journal on
Applied Signal Processing, Special Issue on Joint Audio-Visual Speech Processing, vol. 2002, no. 11, pp. 1213-1227, November 2002.
-
A. K. Katsaggelos, P. S. Aleksic, "Audio-Visual Interaction in Multimedia Communications," Proceedings of International Telecommunications Conference, pp. 47-52, Santa Rita do Sapucai, Brazil, October 2002.
-
Z. Wu, P. S. Aleksic, and A. Katsaggelos, "Lip Tracking for MPEG-4 Facial Animation," International Conference on Multimodal Interfaces (ICMI), pp.
293-298, Pittsburgh, October 2002.
-
P. S. Aleksic, J. J. Williams, Z. Wu, A. K. Katsaggelos, "Audio-Visual Continuous Speech Recognition Using MPEG-4 Compliant Visual Features," International
Conference on Image Processing (ICIP), pp. 960-963, Rochester, NY, September 2002.
-
P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, "Audio-Visual Continuous Speech Recognition Using Mpeg-4 Compliant Visual Features," Defense Advanced Research Project Agency (DARPA) Multimodal Speech Recognition Workshop, N.C. A&T State University, June 2002.
- J.J. Williams, A.K. Katsaggelos and M.A. Randolph, "A Hidden
Markov Model Based Visual Speech Synthesizer," Proceedings
of the IEEE International Conference on Acoustics, Speech and Signal
Processing, Istanbul, Turkey, June 5-9, 2000.
- J. J. Williams, J. C. Rutledge, D. C. Garstecki, and A. K.
Katsaggelos, "Frame Rate and Viseme Analysis for Multimedia
Applications,'' Journal of VLSI Signal Processing Systems,
vol. 23, nos. 1/2, pp. 7-23, Oct. 1998.
- J. J. Williams, J. C. Rutledge, D. C. Garstecki, and A. K.
Katsaggelos, "Frame Rate and Viseme Analysis for Multimedia
Applications,'' Proc. IEEE First Workshop on Multimedia Signal
Processing, pp. 13-18, Princeton, NJ, June 23-25, 1997.

Theses
- J.J. Williams, "Speech-to-Video Conversion for Individuals
with Impaired Hearing," Ph.D. Thesis, Department of Electrical
and Computer Engineering, Northwestern University, June 2000.

More
Information... |
|
|